Imagine you are managing an online gift shop based in the UK. Every day, hundreds of customers make purchases, and over the course of a year, there are more than 500,000 transactions. Naturally, as a business manager, you would want to know:
To answer these questions, I analyzed the Online Retail dataset from the UCI Machine Learning Repository, which contains 542,000 transactions from December 2010 to December 2011. This dataset mainly records purchases of unique, all-occasion gifts, often by wholesale customers.
In this project, I act as a data detective, exploring transactions, segmenting customers, predicting future high-value buyers, and uncovering products that are frequently bought together. The goal is to turn raw data into actionable insights that can guide marketing, sales, and customer retention strategies.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
Before analyzing, it’s important to make sure the data is clean. Missing customer IDs, negative quantities, or incorrect prices could distort insights.
df = pd.read_excel("Online Retail.xlsx")
# Remove missing CustomerID and invalid transactions
df = df.dropna(subset=["CustomerID"])
df = df[(df["Quantity"]>0) & (df["UnitPrice"]>0)]
# Create TotalPrice
df["TotalPrice"] = df["Quantity"]*df["UnitPrice"]
# Convert InvoiceDate to datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
df["InvoiceMonth"] = df["InvoiceDate"].dt.to_period("M").astype(str)
# Get latest transaction date
snapshot_date = df["InvoiceDate"].max()
print("Cleaned dataset shape:", df.shape)
Cleaned dataset shape: (397884, 10)
After cleaning, the dataset contains roughly 398,000 valid transactions across 10 columns, ready for analysis.
df.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | TotalPrice | InvoiceMonth | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 | 2010-12 |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 | 2010-12 |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 | 2010-12 |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 | 2010-12 |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 | 2010-12 |
First, I asked: Which products generate the most revenue?
top_products = df.groupby("Description")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_products[::-1], x=top_products.values[::-1], y=top_products.index[::-1],
orientation='h', title="Top 10 Products by Revenue").show()
We notice that a few items, such as Paper Craft Little Birdie and Regency Cakestand 3 Tier, contribute disproportionately to total revenue. These “star products” are critical to the business.
Strategy: A small number of products dominate revenue, so focusing marketing efforts on these items, through promotions, bundles, or premium packages, can significantly boost revenue.
Next, I explored where most sales come from?
top_countries = df.groupby("Country")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_countries[::-1], x=top_countries.values[::-1], y=top_countries.index[::-1],
orientation='h', title="Top 10 Countries by Revenue").show()
The UK dominates revenue, while other countries contribute less.
Strategy: The UK is the core market, so marketing and logistics should prioritize it. International markets, though smaller, offer opportunities for testing new products or localized campaigns.
We examine seasonality by looking at revenue trends throughout the year:
monthly_sales = df.groupby("InvoiceMonth")["TotalPrice"].sum().reset_index()
monthly_sales = monthly_sales.sort_values("InvoiceMonth")
px.line(monthly_sales, x="InvoiceMonth", y="TotalPrice",
title="Monthly Sales Trend (Full Year)", markers=True).show()
Looking at sales across months, we notice clear ups and downs. Sales reached a peak in August, likely boosted by summer shopping trends or special promotions. However, November showed a noticeable dip, which could be the result of stock shortages or simply weaker customer demand.
Strategy: Prepare promotions and inventory ahead of busy months, and consider targeted campaigns or discounts during slower months to maintain steady revenue.
Not all customers return after a purchase. Some high-value buyers who shopped frequently may have stopped, which can hurt revenue.
Using RFM analysis, we capture customer behavior:
# RFM calculation
rfm = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"TotalPrice": "sum"
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","TotalPrice":"Monetary"})
# Scaling
from sklearn.preprocessing import StandardScaler
rfm_scaled = StandardScaler().fit_transform(rfm)
By applying K-Means clustering, customers were automatically grouped based on their buying patterns.
from sklearn.cluster import KMeans
inertia = [KMeans(n_clusters=k, random_state=42).fit(rfm_scaled).inertia_ for k in range(1,8)]
plt.plot(range(1,8), inertia, 'o-')
plt.xlabel("k"); plt.ylabel("Inertia"); plt.title("Elbow Method")
plt.show()
The Elbow Method suggests that 3 clusters is optimal.
# Train K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
rfm["Segment"] = kmeans.fit_predict(rfm_scaled)
# Segment Profile
print("RFM Segment Profile:\n", rfm.groupby("Segment").mean().round(2))
RFM Segment Profile:
Recency Frequency Monetary
Segment
0 39.98 4.85 2012.11
1 245.02 1.58 631.14
2 6.14 80.21 122888.41
Cluster Profiles:
“At-risk” customers are those who haven’t purchased in a long time—the top 25% in Recency. Visualizing them in a Recency vs Monetary scatter plot helps identify who might churn:
rfm["AtRisk"] = rfm["Recency"] > rfm["Recency"].quantile(0.75)
rfm["AtRisk_Label"] = rfm["AtRisk"].map({True:"At-Risk", False:"Active"})
fig = px.scatter(rfm, x="Recency", y="Monetary", color="AtRisk_Label", size="Frequency",
hover_data=["Recency","Frequency","Monetary"], title="Recency vs Monetary")
fig.show()
Think of them as friends who haven’t visited in a while. Without engagement, they may never return, representing potential lost revenue.
Customer Dynamics:
Strategy:
The next question: Which customers are likely to spend the most in the future?
High-value customers are the top 25% of spenders. Using patterns from past transactions—recency, frequency, average spend, and country—we trained a Random Forest Classifier to predict future high-value buyers.
# Define target
y = (rfm["Monetary"] > rfm["Monetary"].quantile(0.75)).astype(int)
# Features
X = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"UnitPrice": "mean",
"Quantity": "mean",
"Country": lambda x: x.mode()[0]})
X = pd.get_dummies(X, columns=["Country"], drop_first=True)
# Scaling
X_scaled = StandardScaler().fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42, stratify=y)
from sklearn.ensemble import RandomForestClassifier
# Train Random Forest
clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
clf.fit(X_train, y_train)
# Predict probabilities and classify high-value customers
y_proba = clf.predict_proba(X_test)[:,1]
y_pred = (y_proba >= 0.4).astype(int) # threshold 0.4 to capture more high-value customers
from sklearn.metrics import classification_report, roc_auc_score
print("ROC-AUC:", roc_auc_score(y_test, y_proba).round(3))
print(classification_report(y_test, y_pred))
ROC-AUC: 0.943
precision recall f1-score support
0 0.92 0.94 0.93 651
1 0.81 0.75 0.78 217
accuracy 0.89 868
macro avg 0.87 0.84 0.85 868
weighted avg 0.89 0.89 0.89 868
Results:
The model is reliable. We can now proactively engage customers who are most likely to contribute significant revenue, instead of waiting for them to make purchases on their own.
Strategy:
Another way to boost revenue is understanding which products are often bought together. If certain items are frequently purchased as a pair, the business can create bundles or encourage cross-selling.
We analyzed transactions from UK customers using FP-Growth, a method to detect frequently co-purchased items.
from mlxtend.frequent_patterns import fpgrowth, association_rules
# Create basket: UK transactions pivoted by InvoiceNo and Description
basket = (df[df['Country'] == "United Kingdom"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().fillna(0))
# Convert quantities to boolean
basket = basket.applymap(lambda x: x > 0)
# FP-Growth & association rules
frequent_itemsets = fpgrowth(basket, min_support=0.01, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2) \
.sort_values('lift', ascending=False) \
.head(5)
print("Top 5 Market Basket Rules:\n", rules[['antecedents','consequents','support','confidence','lift']])
Top 5 Market Basket Rules:
antecedents consequents support \
541 (HERB MARKER ROSEMARY) (HERB MARKER THYME) 0.010153
540 (HERB MARKER THYME) (HERB MARKER ROSEMARY) 0.010153
914 (REGENCY TEA PLATE GREEN ) (REGENCY TEA PLATE ROSES ) 0.011534
915 (REGENCY TEA PLATE ROSES ) (REGENCY TEA PLATE GREEN ) 0.011534
617 (POPPY'S PLAYHOUSE BEDROOM ) (POPPY'S PLAYHOUSE LIVINGROOM ) 0.010153
confidence lift
541 0.933702 86.829038
540 0.944134 86.829038
914 0.845815 52.930211
915 0.721805 52.930211
617 0.650000 51.769856
For example, Herb Marker Rosemary and Herb Marker Thyme are almost always bought together, while items in the Regency Tea Plate set are commonly purchased as a pair. Similarly, Poppy’s Playhouse Bedroom and Livingroom items are frequently bought together. These patterns reveal natural associations in customer purchases, which can help the business suggest complementary items, increase the average order value, and enhance the overall shopping experience.
Strategy
From nearly 398,000 transactions, several important insights emerged. Revenue is concentrated on a few best-selling products and is strongly driven by the UK market. Customer segmentation revealed distinct groups—VIPs, mid-tier buyers, and at-risk customers—allowing for more focused engagement strategies. Predictive modeling made it possible to identify future high-value customers before they spend, enabling proactive retention. Meanwhile, market basket analysis highlighted natural product pairings, opening opportunities for bundling and cross-selling.
Together, these insights provide a roadmap for making data-driven decisions: strengthening loyalty among top customers, reactivating those at risk, and boosting sales through smarter product strategies. The result is not just higher revenue, but also sustainable growth in the competitive e-commerce landscape.